Descripcion base de datos

Titulo de la base de datos:

Forest Covertype data, Remote Sensing and GIS Program Department of Forest Sciences College of Natural Resources Colorado State University.

El dataset consiste en medidas cartograficas de areas silvestres en el Bosque Nacional Roosevelt en el norte de Colorado en USA y el tipo de covertura vegetal presente. Se realizan en celdas de 30 X 30 metros. Estos bosques tienen poca intervencion humana, por ende las areas caracterizan no usos (como agricultura, poblacion etc) sino clase de bosque predominante. En total son 12 medidas cartograficas y 7 grandes tipos de covertura vegetal.

El dataset:

Numero de columnas:

length(df)
[1] 55

Numero de observaciones:


nrow(df)
[1] 581011

Atributos:

names(df[,c(1:10,55)])
 [1] "Elevation"                          "Aspect"                            
 [3] "Slope"                              "Horizontal_Distance_To_Hydrology"  
 [5] "Vertical_Distance_To_Hydrology"     "Horizontal_Distance_To_Roadways"   
 [7] "Hillshade_9am"                      "Hillshade_Noon"                    
 [9] "Hillshade_3pm"                      "Horizontal_Distance_To_Fire_Points"
[11] "Forest_Cover_Type"                 

Otros atributos son cuatro areas silvestres y 40 tipos de suelo.

El tipo de atributo, las unidades y su descripcion:

Name Data Type Measurement Description
Elevation quantitative meters Elevation in meters
Aspect quantitative azimuth Aspect in degrees azimuth
Slope quantitative degrees Slope in degrees
Horizontal_Distance_To_Hydrology quantitative meters Horz Dist to nearest surface water features
Vertical_Distance_To_Hydrology quantitative meters Vert Dist to nearest surface water features
Horizontal_Distance_To_Roadways quantitative meters Horz Dist to nearest roadway
Hillshade_9am quantitative 0 to 255 index Hillshade index at 9am, summer solstice
Hillshade_Noon quantitative 0 to 255 index Hillshade index at noon, summer soltice
Hillshade_3pm quantitative 0 to 255 index Hillshade index at 3pm, summer solstice
Horizontal_Distance_To_Fire_Points quantitative meters Horz Dist to nearest wildfire ignition points
Wilderness_Area (4 binary columns) qualitative 0 (absence) or 1 (presence) Wilderness area designation
Soil_Type (40 binary columns) qualitative 0 (absence) or 1 (presence Soil Type designation
Cover_Type (7 types) integer 1 to 7 Forest Cover Type designation

Las cuatro areas silvestres:

  • Rawah Wilderness Area
  • Neota Wilderness Area
  • Comanche Peak Wilderness Area
  • Cache la Poudre Wilderness Area

Tipo de covertura forestal:

  • Spruce/Fir (picea -pino-/abeto)
  • Lodgepole Pine (pino)
  • Ponderosa Pine (pino)
  • Cottonwood/Willow (sauces)
  • Aspen (alamos)
  • Douglas-fir (pino oregon)
  • Krummholz (vegetación atrofiada y deformada, arbol “bandera”)

Missings

No hay missings.

colSums(is.na(df))
                         Elevation                             Aspect 
                                 0                                  0 
                             Slope   Horizontal_Distance_To_Hydrology 
                                 0                                  0 
    Vertical_Distance_To_Hydrology    Horizontal_Distance_To_Roadways 
                                 0                                  0 
                     Hillshade_9am                     Hillshade_Noon 
                                 0                                  0 
                     Hillshade_3pm Horizontal_Distance_To_Fire_Points 
                                 0                                  0 
                 Wilderness_Area_1                  Wilderness_Area_2 
                                 0                                  0 
                 Wilderness_Area_3                  Wilderness_Area_4 
                                 0                                  0 
                        Soil_Type1                         Soil_Type2 
                                 0                                  0 
                        Soil_Type3                         Soil_Type4 
                                 0                                  0 
                        Soil_Type5                         Soil_Type6 
                                 0                                  0 
                        Soil_Type7                         Soil_Type8 
                                 0                                  0 
                        Soil_Type9                        Soil_Type10 
                                 0                                  0 
                       Soil_Type11                        Soil_Type12 
                                 0                                  0 
                       Soil_Type13                        Soil_Type14 
                                 0                                  0 
                       Soil_Type15                        Soil_Type16 
                                 0                                  0 
                       Soil_Type17                        Soil_Type18 
                                 0                                  0 
                       Soil_Type19                        Soil_Type20 
                                 0                                  0 
                       Soil_Type21                        Soil_Type22 
                                 0                                  0 
                       Soil_Type23                        Soil_Type24 
                                 0                                  0 
                       Soil_Type25                        Soil_Type26 
                                 0                                  0 
                       Soil_Type27                        Soil_Type28 
                                 0                                  0 
                       Soil_Type29                        Soil_Type30 
                                 0                                  0 
                       Soil_Type31                        Soil_Type32 
                                 0                                  0 
                       Soil_Type33                        Soil_Type34 
                                 0                                  0 
                       Soil_Type35                        Soil_Type36 
                                 0                                  0 
                       Soil_Type37                        Soil_Type38 
                                 0                                  0 
                       Soil_Type39                        Soil_Type40 
                                 0                                  0 
                 Forest_Cover_Type 
                                 0 

###Variables categoricas###

Una de las variables consiste en medidas del tipo de suelo presente. En el dataset esta variable se mapeo a 40 one-hot variables. Cada variable corresponde a un tipo de suelo, catalogado con un codigo que incluye informacion de la zona climatica y zona geologica. La distribucion de esta variable es la siguiente :

barplot(sort(table(df_suelos$suelo)),las=2, log="y", xlab = "suelo", col = "#1b98e0")
png('barsuelo.png')
barplot(sort(table(df_suelos$suelo)),las=2, log="y", xlab = "suelo", col = "#1b98e0")

dev.off()
png 
  2 

El suelo mas abundante tiene el doble de frecuencia que el siguiente mas abundante y su frecuencia esta 4 ordenes de magnitud por encima del suelo menos frecuente. En vista de que son muchas variables individuales a ser consideradas y no pueden ser combinadas (no sin conocimiento experto en geologia) para reducir su numero. No sera considerada en el analisis.

Otra de las variables categoricas es la zona silvestre de la celda. La abundancia de zonas:

barplot(sort(table(df_areas$areas)),las=2, xlab = "areas silvestres", col = "#F4A582")

png('barareas.png')

barplot(sort(table(df_areas$areas)),las=2, xlab = "areas silvestres", col = "#F4A582")

dev.off()
png 
  2 

Las areas silvestres 3 y 4 son alrededor de 8 veces mas abundantes en el dataset que las 1 y 2. Tambien se podria usar para clasificar pero se escoge la clase de cobertura forestal.

Descripcion de la variable objetivo, las clases de cobertura vegetal:

library(plotly)

labels = c('Spruce/Fir','Lodgepole Pine','Ponderosa Pine','Cottonwood/Willow','Aspen','Douglas-fir',
          'Krummholz')
values <- aggregate(df$Elevation~ df$Forest_Cover_Type, data=df, FUN=length)
fig <- plot_ly(type='pie', labels=labels, values=values$`df$Elevation`, 
               textinfo='label+percent',
               insidetextorientation='radial',showlegend = FALSE)
fig


#png('covertype.png')

#fig

#dev.off()

#pie(table(df$Forest_Cover_Type),main='Tipos de covertura')
#barplot(table(df$World.region), xlab = "Region", ylab = "Cantidad de Ciudades",         main="Ciudades por region",cex.names = .3, las=2) 

Se puede observar que los datos estan bastante desbalanceados. Esto incidiría en una regresión pero no en un LDA.

Resumen de los datos:

df_max<-apply(df[,1:10],2,max)
df_min<-apply(df[,1:10],2,min)
df_n <- sweep(df[,1:10], 2, df_min, "-")
range <- df_max-df_min
df_n <- sweep(df_n, 2, range, "/")
boxplot(df_n,las=2)

Todas las distribuciones son sesgadas y tienen muchos outliers. Para clasificar, si se va a aplicar un LDA, hay que garantizar normalidad y que las medias sean distintas, pero se vera mas adelante. Tambien es importante garantizar que los atributos sean independientes.

La variable Aspect es bimodal y probablemente Slope es multimodal. Todas las variables tienen colas largas, excepto Hillshade_3pm. En todas las variables excepto en Elevation las clases siguen la misma tendencia. En esta variable las distribuciones para cada grupo estan centradas distinto:

ggplot2.multiplot(plot1,plot2,plot3,plot4,plot5,plot6,plot7,plot8,plot9,plot10, cols=4)

png('histograms.png')

ggplot2.multiplot(plot1,plot2,plot3,plot4,plot5,plot6,plot7,plot8,plot9,plot10, cols=3)

dev.off()
png 
  2 

Correlaciones

library(ggcorrplot)

corr <- round(cor(subset0[,1:10]), 1)
p.mat<- cor_pmat(subset0[,1:10])

ggcorrplot(corr, lab = TRUE,
     outline.col = "white",lab_size = 3, tl.cex=5,method = "circle",hc.order = TRUE, p.mat = p.mat)
`guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> = "none")` instead.
png('correlation.png')

ggcorrplot(corr, lab = TRUE,
     outline.col = "white",lab_size = 3, tl.cex=5, method = "circle",hc.order = TRUE, p.mat = p.mat)
`guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> = "none")` instead.
dev.off()
png 
  2 

Las variables hillshade estan relacionadas con aspect, slope y entre ellas. Esto tiene que ver con como se calculan las cantidades hillshade que son los sombreados que se dibujan sobre los mapas cartograficos. Se supone una iluminacion simulada que depende de la orientacion a la fuente de luz, la cual esta basada en las variables aspect y slope. Las cantidades distancia vertical y horizontal a cursos de agua tambien estan relacionadas y se calculan con slope y elevation. La distancia horizontal a caminos y a puntos de fuego tambien se calculan con la elevacion y estan relacionados entre si.

Con base en esto acoto las variables a Elevation, Aspect, Slope, Hillshade_noon, Horizontal_Distance_Roadways y Horizontal_Distance_to_Hydrology.

library(GGally)
library(data.table)

color <- as.factor(subset[,7])
ggpairs(subset[,1:5], aes(color = color, alpha = 0.5),lower = list(combo = "count"),upper = "blank",)

Se pueden observar tambien las distribuciones de las clases, son sesgadas como se vio en los boxplots. La variable que podria servir mas para distinguir clases podria ser Elevation.

subsubset <- subset[,c(1:4,6,8,11)]

Analisis explotario con PCA

library(ggbiplot)
library(plyr)

datos_pca <- prcomp(subset[,1:5,7],scale=TRUE)
p <- ggbiplot(datos_pca, obs.scale=0.01,alpha = 0.3,groups=subset$cover_type)
p <- p + xlim(-3, 2) + ylim(-2, 3)

plot(p)

png('pca.png')

plot(p)

dev.off()
png 
  2 

Clasificacion Supervisada

Linear Discriminant Analysis

El primer metodo a implementar es LDA. Para que tenga validez el metodo deben cumplirse los supuestos de normalidad multivariada para cada nivel de cada variable y de homocedasticidad.

Para el test de normalidad multivariada usamos test de Shapiro:

library(mvnormtest)
test <- mshapiro.test(t(subsubset[subsubset$cover_type=='1',1:6]))
shapitest <- test$p.value
test <- mshapiro.test(t(subsubset[subsubset$cover_type=='2',1:6]))
shapitest <- append(shapitest,test$p.value)
test <- mshapiro.test(t(subsubset[subsubset$cover_type=='3',1:6]))
shapitest <- append(shapitest,test$p.value)
test <- mshapiro.test(t(subsubset[subsubset$cover_type=='4',1:6]))
shapitest <- append(shapitest,test$p.value)
test <- mshapiro.test(t(subsubset[subsubset$cover_type=='5',1:6]))
shapitest <- append(shapitest,test$p.value)
test <- mshapiro.test(t(subsubset[subsubset$cover_type=='6',1:6]))
shapitest <- append(shapitest,test$p.value)
test <- mshapiro.test(t(subsubset[subsubset$cover_type=='7',1:6]))
shapitest <- append(shapitest,test$p.value)

shapitest
[1] 5.153787e-22 9.523226e-34 3.087307e-06 2.832202e-08 3.613977e-06 9.314272e-13 1.285806e-07

Rechaza normalidad para todos los niveles de todas las variables. Para el total:

library(mvnormtest)
mshapiro.test(t(subset[,1:6]))

    Shapiro-Wilk normality test

data:  Z
W = 0.86531, p-value < 2.2e-16

Si el nivel de significacion por defecto es de 0.05, rechaza la normalidad multivariada.

El siguiente supuesto de homocedasticidad lo evaluamos con el test M de Box:

library(biotools)
Loading required package: MASS
---
biotools version 4.2
boxM(data=subsubset[,1:6],grouping=subsubset[,7])

    Box's M-test for Homogeneity of Covariance Matrices

data:  subsubset[, 1:6]
Chi-Sq (approx.) = 1840.7, df = 126, p-value < 2.2e-16

Rechaza la homogeneidad de varianzas. Se tendria que realizar un LDA robusto (cuadratico).

Igual, a pesar de haber rechazado la normalidad hacemos el test de Hotelling para determinar si las medias de los grupos son distintas:

library(Hotelling)
Loading required package: corpcor
fitProd = hotelling.test(.~ cover_type, data = subsubset) 
fitProd
Test stat:  1813.2 
Numerator df:  6 
Denominator df:  4230 
P-value:  0 

Por ende rechazamos que las medias sean iguales. El cover_type es discriminante.

LDA

Habiendo determinado que la variable de clase es discriminante, calculamos el lda:


library(MASS)

modelo_lda <- lda(formula = cover_type ~ Elevation + Aspect + Slope + Horizontal_Distance_To_Hydrology + Horizontal_Distance_To_Roadways + Hillshade_Noon, data = subset)

Ahora se prueba el modelo:

predicciones <-  predict(object = modelo_lda, newdata = testeo, method = "predictive")
table(clase, predicciones$class, dnn = c("Clase real", "Clase predicha"))
          Clase predicha
Clase real   1   2   3   4   5   6   7
         1 510 203   0   0   0   0   7
         2 199 791  11   0   0  10   0
         3   0  24  77   1   0   6   0
         4   0   0  10   1   0   0   0
         5   1  32   0   0   0   0   0
         6   0  20  45   0   0   8   0
         7  62   2   0   0   0   0   5

Los errores de prediccion:

training_error <- mean(clase != predicciones$class) * 100 
training_error 
[1] 31.25926

El error da bastante grande…

Con una variable mas:

library(klaR) 

testeo$clase <- as.factor(testeo$clase)

partimat(clase ~ ., data = testeo, method = "lda", col.correct=NA, col.wrong=NA,plot.matrix=FALSE,image.colors =rainbow(7))

Sin esa variable:

library(klaR) 

testeo$clase <- as.factor(testeo$clase)

partimat(clase ~ ., data = testeo[,c(1:5,7)], method = "lda", col.correct=NA, col.wrong=NA,plot.matrix=FALSE,image.colors =rainbow(7),scaled=TRUE)

png('partiplot.png')

partimat(clase ~ ., data = testeo[,c(1:5,7)], method = "lda", col.correct=NA, col.wrong=NA,plot.matrix=FALSE,image.colors =rainbow(7),scaled=TRUE)

dev.off()
png 
  2 

Discrimina muy mal. Ahora, si transformo las variables con un log:

library(IDPmisc)

df_trans <- log(subset[,1:6])
subset_trans <- cbind(df_trans,cover_type)
subset_t <- NaRV.omit(subsubset_trans)
subset_t$cover_type <- as.factor(subset_t$cover_type)
library(MASS)

modelo_lda_trans <- lda(formula = cover_type ~. , data = subset_t)
library(IDPmisc)
library(dplyr)

test_trans <- log(test)
test_t <- cbind(test_trans,clase)
test_t <- NaRV.omit(test_t)
clase_t <-test_t$clase
predicciones_t <-  predict(object = modelo_lda_trans, newdata = test_t)
table(clase_t, predicciones_t$class, dnn = c("Clase real", "Clase predicha"))
          Clase predicha
Clase real   1   2   3   4   5   6   7
         1   0 682   0   0   0   0   0
         2   0 969   0   0   0   0   0
         3   0 103   0   0   0   0   0
         4   0   8   0   0   0   0   0
         5   0  29   0   0   0   0   0
         6   0  68   0   0   0   0   0
         7   0  67   0   0   0   0   0
training_error_t <- mean(clase_t != predicciones_t$class) * 100 
training_error_t 
[1] 49.68847

Pesimo.

Si considero una variable mas, sin transformar:


library(MASS)

modelo_lda_2 <- lda(formula = cover_type ~. , data = subset_2)
predicciones_2 <-  predict(object = modelo_lda_2, newdata = testeo_2)
table(testeo_2$clase, predicciones_2$class, dnn = c("Clase real", "Clase predicha"))
          Clase predicha
Clase real   1   2   3   4   5   6   7
         1 516 200   0   0   0   0   4
         2 201 790  11   0   0   9   0
         3   0  25  69   4   0  10   0
         4   0   0  10   1   0   0   0
         5   0  33   0   0   0   0   0
         6   0  19  44   0   0  10   0
         7  63   1   0   0   0   0   5
training_error_2 <- mean(testeo_2$clase != predicciones_2$class) * 100 
training_error_2 
[1] 31.30864

Considerar las variables con logaritmo o considerar una variable adicional no mejoro la prediccion. Se intentara ahora con un lda robusto:


library(MASS)

modelo_qda <- qda(formula = cover_type ~ ., data = subset)
predicciones_qda <-  predict(object = modelo_qda, newdata = testeo, method = "predictive")
table(clase, predicciones_qda$class, dnn = c("Clase real", "Clase predicha"))
          Clase predicha
Clase real   1   2   3   4   5   6   7
         1  66  15   0   4 161  37 437
         2  92  66   1  30 572  86 164
         3   0   0   3  68   8  29   0
         4   0   0   0  11   0   0   0
         5   2   0   0   0  29   2   0
         6   0   0   4  37   7  25   0
         7   0   0   0   0   1   0  68
training_error_qda <- mean(clase != predicciones_qda$class) * 100 
training_error_qda
[1] 86.76543

Da peor…

Con las clases escaladas:

datos_escalados <- as.data.frame(scale(subset[,1:5]))
subset_esc <- cbind(datos_escalados,cover_type)
subset_esc$cover_type <- as.factor(subset_esc$cover_type)

test de Shapiro para los datos escalados:

subset_esc$cover_type <- as.factor(subset_esc$cover_type)
library(mvnormtest)
test <- mshapiro.test(t(subset_esc[subset_esc$cover_type=='1',1:5]))
shapitest <- test$p.value
test <- mshapiro.test(t(subset_esc[subset_esc$cover_type=='2',1:5]))
shapitest <- append(shapitest,test$p.value)
test <- mshapiro.test(t(subset_esc[subset_esc$cover_type=='3',1:5]))
shapitest <- append(shapitest,test$p.value)
test <- mshapiro.test(t(subset_esc[subset_esc$cover_type=='4',1:5]))
shapitest <- append(shapitest,test$p.value)
test <- mshapiro.test(t(subset_esc[subset_esc$cover_type=='5',1:5]))
shapitest <- append(shapitest,test$p.value)
test <- mshapiro.test(t(subset_esc[subset_esc$cover_type=='6',1:5]))
shapitest <- append(shapitest,test$p.value)
test <- mshapiro.test(t(subset_esc[subset_esc$cover_type=='7',1:5]))
shapitest <- append(shapitest,test$p.value)

shapitest
[1] 5.239585e-15 2.098729e-18 1.165629e-01 5.931935e-04 1.401489e-09 1.025656e-02 1.953662e-10

library(MASS)

modelo_lda_esc <- lda(formula = cover_type ~., data = subset_esc)
test_esc <- as.data.frame(scale(test))
test_esc <- cbind(test_esc,clase)
test_esc$clase <- as.factor(test_esc$clase)
predicciones_esc <-  predict(object = modelo_lda_esc, newdata = test_esc, method = "predictive")
table(clase, predicciones_esc$class, dnn = c("Clase real", "Clase predicha"))
          Clase predicha
Clase real   1   2   3   4   5   6   7
         1 521 192   0   0   0   0   7
         2 216 774  11   0   0  10   0
         3   0  25  76   1   0   6   0
         4   0   0  10   1   0   0   0
         5   1  32   0   0   0   0   0
         6   0  21  43   0   0   9   0
         7  63   1   0   0   0   0   5
training_error_esc <- mean(clase != predicciones_esc$class) * 100 
training_error_esc
[1] 31.55556

El mejor resultado fue el inicial. En general se predicen muy mal las clases menos abundantes.

El qda con los datos escalados:


library(MASS)

modelo_qda_esc <- qda(formula = cover_type ~ ., data = subset[,1:5,7])
predicciones_qda_esc <-  predict(object = modelo_qda_esc, newdata = testeo[,1:5,7], method = "predictive")
table(clase, predicciones_qda_esc$class, dnn = c("Clase real", "Clase predicha"))
          Clase predicha
Clase real   1   2   3   4   5   6   7
         1 115  21   0   2 168  10 404
         2 129  99   0  15 552  74 142
         3   0   0   6  63   8  31   0
         4   0   0   0  10   0   1   0
         5   2   0   0   0  29   2   0
         6   0   0   0  39   7  27   0
         7   0   0   0   0   1   0  68
training_errorqda_esc <- mean(clase != predicciones_qda_esc$class) * 100 
training_errorqda_esc
[1] 82.51852

Se realizara el analisis con solo dos clases:

subsubset2 <-subset[(subset$cover_type==2),]
subsubset1 <-subset[(subset$cover_type==1),]
subsubset <-rbind(subsubset1,subsubset2)
subsubset$cover_type <- droplevels(subsubset$cover_type)

El test de normalidad multivariada:

library(mvnormtest)

testing <- mshapiro.test(t(subsubset[subsubset$cover_type=='1',1:6]))
shapitest <- testing$p.value
testing <- mshapiro.test(t(subsubset[subsubset$cover_type=='2',1:6]))
shapitest <- append(shapitest,testing$p.value)

shapitest
[1] 2.929115e-30 3.038726e-36

Rechaza normalidad para cada variable para cada nivel de la variable.

Con las variables estandarizadas:

library(clusterSim)

#subsubset_es <- scale(subsubset[,1:6], center = median(subsubset))
subsubset_es <- data.Normalization (subsubset[,1:6],type="n2",normalization="column")
subsubset_es <- cbind(subsubset_es,subsubset[,7])

Con las variables estandarizadas y sacando hillshade:

library(mvnormtest)

subsubset_2 <- subsubset_es[,c(1:5,7)]
testing <- mshapiro.test(t(subsubset_2[subsubset_2$cover_type=='1',1:5]))
shapitest <- testing$p.value
testing <- mshapiro.test(t(subsubset_2[subsubset_2$cover_type=='2',1:5]))
shapitest <- append(shapitest,testing$p.value)

shapitest
[1] 5.239585e-15 2.098729e-18

Rechaza, sin sacar y sacando hillshade.

Testeando homocedasticidad:

library(biotools)

boxM(data=subsubset[,1:6],grouping=subsubset[,7])

Rechaza igualdad de varianzas.

Como el test M de Box es sensible a la falta de normalidad, realizamos el de Levene:

library(car)
Loading required package: carData

Attaching package: ‘car’

The following object is masked from ‘package:dplyr’:

    recode
leveneTest( Elevation + Aspect + Slope + Horizontal_Distance_To_Hydrology + Horizontal_Distance_To_Roadways + Hillshade_Noon ~ subsubset$cover_type, data = subsubset)
Levene's Test for Homogeneity of Variance (center = median)
        Df F value    Pr(>F)    
group    1  24.874 6.365e-07 ***
      4268                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

El valor es menor que el nivel de significancia 0.001. Rechazamos la hipotesis nula y concluimos que las varianzas no son iguales.

De todas maneras hacemos el test de Hotelling para testear diferencia de medias:

library(Hotelling)

fitProd = hotelling.test(.~ subsubset$cover_type, data = subsubset) 
fitProd

Rechaza igualdad de medias, pero como no se cumple la normalidad multivariada este resultado no es confiable.

LDA con dos clases

el conjunto de test para dos clases:

library(dplyr)

testeo2 <-testeo[(testeo$clase==2),]
testeo1 <-testeo[(testeo$clase==1),]
testeo_2class <-rbind(testeo1,testeo2)
testeo_2class$clase <- droplevels(testeo_2class$clase)
clase_2class <- testeo_2class[,7]
test_2class <-testeo_2class[,1:6]

library(MASS)

modelo_lda_2class <- lda(formula = cover_type ~. , data = subsubset)

la clasificacion ingenua:

predicciones_2class_0 <-  predict(object = modelo_lda_2class, newdata = subsubset)
table(subsubset$cover_type, predicciones_2class_0$class, dnn = c("Clase real", "Clase predicha"))
          Clase predicha
Clase real    1    2
         1 1343  488
         2  525 1914
training_error_2cl_0 <- mean(subsubset$cover_type != predicciones_2class_0$class) * 100 
training_error_2cl_0
[1] 23.72365
predicciones_2class <-  predict(object = modelo_lda_2class, newdata = testeo_2class)
table(clase_2class, predicciones_2class$class, dnn = c("Clase real", "Clase predicha"))
          Clase predicha
Clase real   1   2
         1 516 204
         2 199 812
training_error_2cl <- mean(clase_2class != predicciones_2class$class) * 100 
training_error_2cl
[1] 23.28134
library(klaR) 

partimat(subsubset$cover_type ~ ., data = subsubset, method = "lda", col.correct=NA, col.wrong=NA, plot.matrix=FALSE,image.colors =rainbow(2))

La variable que mejor separa es elevacion.

Con las variables estandarizadas:


library(MASS)

modelo_lda_2class_est <- lda(formula = cover_type ~. , data = subsubset_es)

la clasificacion ingenua:

predicciones_2class_es_0 <-  predict(object = modelo_lda_2class_est, newdata = subsubset_es)
table(subsubset_es$cover_type, predicciones_2class_es_0$class, dnn = c("Clase real", "Clase predicha"))
          Clase predicha
Clase real    1    2
         1 1343  488
         2  525 1914
training_error_2cl_es_0 <- mean(subsubset_es$cover_type != predicciones_2class_es_0$class) * 100 
training_error_2cl_es_0
[1] 23.72365
library(clusterSim)

#subsubset_es <- scale(subsubset[,1:6], center = median(subsubset))
testeo_es <- data.Normalization (testeo_2class[,1:6],type="n2",normalization="column")
testeo_es <- cbind(testeo_es,testeo_2class$clase)
testeo_es <- rename(testeo_es, cover_type=`testeo_2class$clase`)
predicciones_2class_es <-  predict(object = modelo_lda_2class_est, newdata = testeo_es)
table(testeo_es$clase, predicciones_2class_es$class, dnn = c("Clase real", "Clase predicha"))
training_error_2cl_es <- mean(testeo_es$clase != predicciones_2class_es$class) * 100 
training_error_2cl_es
[1] 23.62796

Haciendo el discriminante robusto:


library(MASS)

modelo_qda_2class_est <- qda(formula = cover_type ~. , data = subsubset_es)
prediccionesqda_2c_es_0 <-  predict(object = modelo_qda_2class_est, newdata = subsubset_es)
table(subsubset_es$cover_type, prediccionesqda_2c_es_0$class, dnn = c("Clase real", "Clase predicha"))
          Clase predicha
Clase real    1    2
         1 1397  434
         2  613 1826
training_errorqda_2cl_es_0 <- mean(subsubset_es$cover_type != prediccionesqda_2c_es_0$class) * 100 
training_errorqda_2cl_es_0
[1] 24.51991
prediccionesqda_2c_es <-  predict(object = modelo_qda_2class_est, newdata = testeo_es)
table(testeo_es$clase, prediccionesqda_2c_es$class, dnn = c("Clase real", "Clase predicha"))
          Clase predicha
Clase real   1   2
         1 559 161
         2 252 759
training_errorqda_2cl_es <- mean(testeo_es$clase != prediccionesqda_2c_es$class) * 100 
training_errorqda_2cl_es
[1] 23.85904

Incluso empeora un poco.

Support Vector machine

Clasificando con svm el dataset con las 7 clases:

library(e1071)

model.svm = svm( subset$cover_type ~ ., data = subset[,c(1:5,7)], kernel = "radial", cost = 10, scale = TRUE)
print(model.svm)

Call:
svm(formula = subset$cover_type ~ ., data = subset[, c(1:5, 7)], kernel = "radial", cost = 10, 
    scale = TRUE)


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  radial 
       cost:  10 

Number of Support Vectors:  3264
predicciones_svm = predict(model.svm, subset[,c(1:5,7)])

table(subset$cover_type, predicciones_svm, dnn = c("Clase real", "Clase predicha"))
          Clase predicha
Clase real    1    2    3    4    5    6    7
         1 1342  465    0    0    0    0   24
         2  373 2035   24    0    0    4    3
         3    0   60  225    0    0    9    0
         4    0    0    7   13    0    1    0
         5    0   60    0    0   18    0    0
         6    0   66   46    0    0   35    0
         7   86    4    0    0    0    0  100
training_error_svm_0 <- mean(subset$cover_type != predicciones_svm) * 100 
training_error_svm_0
[1] 24.64

El kernel que mejor da es el radial.

predicciones_svm_te = predict(model.svm, testeo)

table(testeo$clase, predicciones_svm_te, dnn = c("Clase real", "Clase predicha"))
          Clase predicha
Clase real   1   2   3   4   5   6   7
         1 499 209   0   0   0   0  12
         2 169 821  13   0   3   4   1
         3   0  26  76   1   0   5   0
         4   0   0   8   3   0   0   0
         5   0  29   0   0   4   0   0
         6   0  27  38   0   0   8   0
         7  37   3   0   0   0   0  29
training_error_svm <- mean(testeo$clase != predicciones_svm_te) * 100 
training_error_svm
[1] 28.88889

Si lo hacemos con menos variables:

subsubset1 <-subset[(subset$cover_type==1),]
subsubset2 <-subset[(subset$cover_type==2),]
subsubset <-rbind(subsubset1,subsubset2)
subsubset$cover_type <- droplevels(subsubset$cover_type)
library(e1071)

model.svm_2 = svm( subsubset$cover_type ~ ., data = subsubset[,c(1:5,7)], kernel = "radial", cost = 10, scale = TRUE)
print(model.svm)

Call:
svm(formula = subset$cover_type ~ ., data = subset[, c(1:5, 7)], kernel = "radial", cost = 10, 
    scale = TRUE)


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  radial 
       cost:  10 

Number of Support Vectors:  3264
predicciones_svm_2 = predict(model.svm_2, subsubset[,c(1:5,7)])

table(subsubset$cover_type, predicciones_svm_2, dnn = c("Clase real", "Clase predicha"))
          Clase predicha
Clase real    1    2
         1 1359  472
         2  374 2065
training_error_svm_2 <- mean(subsubset$cover_type != predicciones_svm_2) * 100 
training_error_svm_2
[1] 19.81265

Ahora con el test:

predicciones_svm_te_2 = predict(model.svm_2, testeo_2class)

table(testeo_2class$clase, predicciones_svm_te_2, dnn = c("Clase real", "Clase predicha"))
          Clase predicha
Clase real   1   2
         1 501 219
         2 170 841
training_error_svm_2_te <- mean(testeo_2class$clase != predicciones_svm_te_2) * 100 
training_error_svm_2_te
[1] 22.47256

Graficando para todas las clases:

plot(model.svm, data=subset[,c(1:5,7)],Elevation ~ Aspect,  color.palette=topo.colors)

plot(model.svm, data=subset[,c(1:5,7)], Elevation ~ Slope, color.palette=topo.colors)

plot(model.svm, data=subset[,c(1:5,7)], Elevation ~ Horizontal_Distance_To_Hydrology, color.palette=topo.colors)

plot(model.svm, data=subset[,c(1:5,7)], Elevation ~ Horizontal_Distance_To_Roadways, color.palette=topo.colors)

plot(model.svm, data=subset[,c(1:5,7)], Aspect ~ Slope, color.palette=topo.colors)

plot(model.svm, data=subset[,c(1:5,7)], Aspect ~ Horizontal_Distance_To_Hydrology, color.palette=topo.colors)

plot(model.svm, data=subset[,c(1:5,7)], Aspect ~ Horizontal_Distance_To_Roadways, color.palette=topo.colors)

plot(model.svm, data=subset[,c(1:5,7)], Slope ~ Horizontal_Distance_To_Hydrology, color.palette=topo.colors)

plot(model.svm, data=subset[,c(1:5,7)], Slope ~ Horizontal_Distance_To_Roadways, color.palette=topo.colors)

plot(model.svm, data=subset[,c(1:5,7)], Horizontal_Distance_To_Hydrology ~ Horizontal_Distance_To_Roadways, color.palette=topo.colors)

plot(model.svm_2, data=subsubset[,c(1:5,7)],Elevation ~ Aspect,  color.palette=topo.colors)

plot(model.svm_2, data=subsubset[,c(1:5,7)], Elevation ~ Slope, color.palette=topo.colors)

plot(model.svm_2, data=subsubset[,c(1:5,7)], Elevation ~ Horizontal_Distance_To_Hydrology, color.palette=topo.colors)

plot(model.svm_2, data=subsubset[,c(1:5,7)], Elevation ~ Horizontal_Distance_To_Roadways, color.palette=topo.colors)

plot(model.svm_2, data=subsubset[,c(1:5,7)], Aspect ~ Slope, color.palette=topo.colors)

plot(model.svm_2, data=subsubset[,c(1:5,7)], Aspect ~ Horizontal_Distance_To_Hydrology, color.palette=topo.colors)

plot(model.svm_2, data=subsubset[,c(1:5,7)], Aspect ~ Horizontal_Distance_To_Roadways, color.palette=topo.colors)

plot(model.svm_2, data=subsubset[,c(1:5,7)], Slope ~ Horizontal_Distance_To_Hydrology, color.palette=topo.colors)

plot(model.svm_2, data=subsubset[,c(1:5,7)], Slope ~ Horizontal_Distance_To_Roadways, color.palette=topo.colors)

plot(model.svm_2, data=subsubset[,c(1:5,7)], Horizontal_Distance_To_Hydrology ~ Horizontal_Distance_To_Roadways, color.palette=topo.colors)

Se dibujan mas fronteras cuando la variable Elevation esta presente.

Clasificacion no supervisada

Para la clasificacion no supervisada, comenzamos con metodos jerarquicos porque k-means es sensible a outliers:

datos_clust <- subset
mat_dist <- dist(x = subset, method = "euclidean") 

Dendrogramas para distintos metodos, suponemos 7 clusters:

hc_complete <- hclust(d = mat_dist, method = "complete") 
hc_average  <- hclust(d = mat_dist, method = "average")
hc_single   <- hclust(d = mat_dist, method = "single")
hc_ward     <- hclust(d = mat_dist, method = "ward.D2")
hc_cn     <- hclust(d = mat_dist, method = "centroid")

Construyendo los dendrogramas:

cantidad_clusters = 7

plot(hc_complete)
rect.hclust(hc_complete, k=cantidad_clusters, border="red") #


jer_complete<-cutree(hc_complete,k=cantidad_clusters)           #
datos_clust$jer_complete=jer_complete
cantidad_clusters = 7

plot(hc_average)
rect.hclust(hc_average, k=cantidad_clusters, border="red") #


jer_average<-cutree(hc_average,k=cantidad_clusters)           #
datos_clust$jer_average=jer_average
cantidad_clusters = 7

plot(hc_single)
rect.hclust(hc_single, k=cantidad_clusters, border="red") #


jer_single<-cutree(hc_single,k=cantidad_clusters)           #
#datos$jer_complete=jer_complete

Da feisimo…

cantidad_clusters = 7

plot(hc_ward)
rect.hclust(hc_ward, k=cantidad_clusters, border="red") #


jer_ward<-cutree(hc_ward,k=cantidad_clusters)           #
datos_clust$jer_ward=jer_ward
cantidad_clusters = 7

plot(hc_cn)
rect.hclust(hc_cn, k=cantidad_clusters, border="red") #


jer_cn<-cutree(hc_cn,k=cantidad_clusters)           #
datos_clust$jer_cn=jer_cn

Los coeficientes de correlacion cofenetica:

cor(x = mat_dist, cophenetic(hc_complete))
[1] 0.7768551
cor(x = mat_dist, cophenetic(hc_average))
[1] 0.7736942
cor(x = mat_dist, cophenetic(hc_single))
[1] 0.1769797
cor(x = mat_dist, cophenetic(hc_ward))
[1] 0.6320018
cor(x = mat_dist, cophenetic(hc_cn))
[1] 0.7774966

El linkage single es el peor.

datos_clust$jer_complete <- as.factor(datos_clust$jer_cn)
p1 <- ggplot(datos_clust, aes(x=Elevation, y=Aspect, color=jer_cn)) +
  geom_point() + scale_color_brewer(palette="Dark2")

p2 <- ggplot(datos_clust, aes(x=Elevation, y=Slope, color=jer_cn)) +
  geom_point()
p2 + scale_color_brewer(palette="Dark2")

p3 <- ggplot(datos_clust, aes(x=Elevation, y=Horizontal_Distance_To_Hydrology, color=jer_cn)) +
  geom_point()
p3 + scale_color_brewer(palette="Dark2")

p4 <- ggplot(datos_clust, aes(x=Elevation, y=Horizontal_Distance_To_Roadways, color=jer_cn)) +
  geom_point()
p4 + scale_color_brewer(palette="Dark2")
library(ggplot2)
library(easyGgplot2)

datos_clust$jer_complete <- as.factor(datos_clust$jer_cn)

p2 <- ggplot(datos_clust, aes(x=Elevation, y=Slope, color=jer_cn, alpha=0.5)) +
  geom_point() + scale_color_brewer(palette="Dark2") + theme(legend.position = "none")

p3 <- ggplot(datos_clust, aes(x=Elevation, y=Horizontal_Distance_To_Hydrology, color=jer_cn,alpha=0.5)) +
  geom_point() + scale_color_brewer(palette="Dark2")+ theme(legend.position = "none") +labs(y='H_Hydrology')


p4 <- ggplot(datos_clust, aes(x=Elevation, y=Horizontal_Distance_To_Roadways, color=jer_cn,alpha=0.5)) +
  geom_point() + scale_color_brewer(palette="Dark2")+ theme(legend.position = "none")+labs(y='H_Roadways')

p7 <- ggplot(datos_clust, aes(x=Aspect, y=Horizontal_Distance_To_Roadways, color=jer_cn,alpha=0.5)) +
  geom_point() + scale_color_brewer(palette="Dark2")+ theme(legend.position = "none")+labs(y='H_Roadways')

p9 <- ggplot(datos_clust, aes(x=Slope, y=Horizontal_Distance_To_Roadways, color=jer_cn,alpha=0.5)) +
  geom_point() + scale_color_brewer(palette="Dark2")+ theme(legend.position = "none")+labs(y='H_Roadways')

p10 <- ggplot(datos_clust, aes(x=Horizontal_Distance_To_Hydrology, y=Horizontal_Distance_To_Roadways, color=jer_cn,alpha=0.5)) +
  geom_point() + scale_color_brewer(palette="Dark2")+ theme(legend.position = "none")+labs(x='H_Hydrology',y='H_Roadways')

ggplot2.multiplot(p2,p3,p4,p7,p9,p10,cols=3)

png('jerarquical.png')

ggplot2.multiplot(p2,p3,p4,p7,p9,p10,cols=3)

dev.off()
png 
  2 

library(ggplot2)

p <- ggplot(datos_clust, aes(x=Aspect, y=Slope, color=jer_cn)) +
  geom_point()
p + scale_color_brewer(palette="Dark2")

p <- ggplot(datos_clust, aes(x=Aspect, y=Horizontal_Distance_To_Hydrology, color=jer_cn)) +
  geom_point()
p + scale_color_brewer(palette="Dark2")

p <- ggplot(datos_clust, aes(x=Aspect, y=Horizontal_Distance_To_Roadways, color=jer_cn)) +
  geom_point()
p + scale_color_brewer(palette="Dark2")


p <- ggplot(datos_clust, aes(x=Slope, y=Horizontal_Distance_To_Hydrology, color=jer_cn)) +
  geom_point()
p + scale_color_brewer(palette="Dark2")
library(ggplot2)

p <- ggplot(datos_clust, aes(x=Slope, y=Horizontal_Distance_To_Roadways, color=jer_cn)) +
  geom_point()
p + scale_color_brewer(palette="Dark2")

p <- ggplot(datos_clust, aes(x=Horizontal_Distance_To_Hydrology, y=Horizontal_Distance_To_Roadways, color=jer_complete)) +
  geom_point()
p + scale_color_brewer(palette="Dark2")

Solo lo hice con el primer metodo de cluster jerarquico que fue el que mejor dio el coeficiente cofrenetico.

library(cluster)

datos_kmeans = datos_clust[1:5]

cantidad_clusters=7

CL  = kmeans(scale(datos_kmeans),cantidad_clusters)
datos_kmeans$kmeans = CL$cluster
library(ggplot2)
library(easyGgplot2)

datos_kmeans$kmeans <- as.factor(datos_kmeans$kmeans)

p1 <- ggplot(datos_clust, aes(x=Elevation, y=Aspect, color=datos_kmeans$kmeans, alpha=0.5)) +
  geom_point() + scale_color_brewer(palette="Dark2") + theme(legend.position = "none")

p2 <- ggplot(datos_clust, aes(x=Elevation, y=Slope, color=datos_kmeans$kmeans, alpha=0.5)) +
  geom_point() + scale_color_brewer(palette="Dark2") + theme(legend.position = "none")

p3 <- ggplot(datos_clust, aes(x=Elevation, y=Horizontal_Distance_To_Hydrology, color=datos_kmeans$kmeans,alpha=0.5)) +
  geom_point() + scale_color_brewer(palette="Dark2")+ theme(legend.position = "none") +labs(y='H_Hydrology')

p4 <- ggplot(datos_clust, aes(x=Elevation, y=Horizontal_Distance_To_Roadways, color=datos_kmeans$kmeans,alpha=0.5)) +
  geom_point() + scale_color_brewer(palette="Dark2")+ theme(legend.position = "none")+labs(y='H_Roadways')

p5 <- ggplot(datos_clust, aes(x=Aspect, y=Slope, color=datos_kmeans$kmeans, alpha=0.5)) +
  geom_point() + scale_color_brewer(palette="Dark2") + theme(legend.position = "none")

p6 <- ggplot(datos_clust, aes(x=Aspect, y=Horizontal_Distance_To_Hydrology, color=datos_kmeans$kmeans, alpha=0.5)) +
  geom_point() + scale_color_brewer(palette="Dark2") + theme(legend.position = "none") +labs(y='H_Hydrology')

p7 <- ggplot(datos_clust, aes(x=Aspect, y=Horizontal_Distance_To_Roadways, color=datos_kmeans$kmeans,alpha=0.5)) +
  geom_point() + scale_color_brewer(palette="Dark2")+ theme(legend.position = "none")+labs(y='H_Roadways')

p8 <- ggplot(datos_clust, aes(x=Slope, y=Horizontal_Distance_To_Hydrology, color=datos_kmeans$kmeans,alpha=0.5)) +
  geom_point() + scale_color_brewer(palette="Dark2")+ theme(legend.position = "none")+labs(y='H_Roadways')

p9 <- ggplot(datos_clust, aes(x=Slope, y=Horizontal_Distance_To_Roadways, color=datos_kmeans$kmeans,alpha=0.5)) +
  geom_point() + scale_color_brewer(palette="Dark2")+ theme(legend.position = "none")+labs(y='H_Roadways')

p10 <- ggplot(datos_clust, aes(x=Horizontal_Distance_To_Hydrology, y=Horizontal_Distance_To_Roadways, color=datos_kmeans$kmeans,alpha=0.5)) +
  geom_point() + scale_color_brewer(palette="Dark2")+ theme(legend.position = "none")+labs(x='H_Hydrology',y='H_Roadways')

ggplot2.multiplot(p1,p2,p3,p4,p5,p6,p7,p8,p9,p10,cols=3)

png('jerarquical_2.png')

ggplot2.multiplot(p1,p2,p3,p4,p5,p6,p7,p8,p9,p10,cols=3)

dev.off()
png 
  2 

---
title: "Trabajo Práctico AID"
output: html_notebook
---
```{r include = FALSE}
knitr::opts_chunk$set(echo=FALSE)
```

# Descripcion base de datos

### Titulo de la base de datos:

Forest Covertype data, Remote Sensing and GIS Program Department of Forest Sciences College of Natural Resources Colorado State University.

El dataset consiste en medidas cartograficas de areas silvestres en el Bosque Nacional Roosevelt en el 
norte de Colorado en USA y el tipo de covertura vegetal presente. Se realizan en celdas  de 30 X 30 metros. 
Estos bosques tienen poca intervencion 
humana, por ende las areas caracterizan no usos (como agricultura, poblacion etc) sino clase de bosque 
predominante. En total son 12 medidas cartograficas y 7 grandes tipos de covertura vegetal.

El dataset:
	
Numero de columnas:
```{r}
length(df)
```

Numero de observaciones:  
```{r}

nrow(df)
```
### Atributos:

```{r}
names(df[,c(1:10,55)])
```
Otros atributos son cuatro areas silvestres y 40 tipos de suelo.

El tipo de atributo, las unidades y su descripcion:

Name                              |   Data Type       |      Measurement        |            Description
----------------------------------|-------------------|-------------------------|----------------------------------------
Elevation                         |   quantitative    |  meters                 |  Elevation in meters
Aspect                            |   quantitative    |  azimuth                |  Aspect in degrees azimuth
Slope                             |   quantitative    |  degrees                |  Slope in degrees
Horizontal_Distance_To_Hydrology  |   quantitative    |  meters                 |  Horz Dist to nearest surface water features
Vertical_Distance_To_Hydrology    |   quantitative    |  meters                 |  Vert Dist to nearest surface water features
Horizontal_Distance_To_Roadways   |   quantitative    |  meters                 |  Horz Dist to nearest roadway
Hillshade_9am                     |   quantitative    |  0 to 255 index         |  Hillshade index at 9am, summer solstice
Hillshade_Noon                    |   quantitative    |  0 to 255 index         |  Hillshade index at noon, summer soltice
Hillshade_3pm                     |   quantitative    |  0 to 255 index         |  Hillshade index at 3pm, summer solstice
Horizontal_Distance_To_Fire_Points|   quantitative    |  meters                 |  Horz Dist to nearest wildfire ignition points
Wilderness_Area (4 binary columns)|   qualitative     |  0 (absence) or 1 (presence)|  Wilderness area designation
Soil_Type (40 binary columns)     |   qualitative     |  0 (absence) or 1 (presence |  Soil Type designation
Cover_Type (7 types)              |   integer         |  1 to 7                     |  Forest Cover Type designation

Las cuatro areas silvestres:  	

* Rawah Wilderness Area
* Neota Wilderness Area
* Comanche Peak Wilderness Area
* Cache la Poudre Wilderness Area


Tipo de covertura forestal:	

* Spruce/Fir (picea -pino-/abeto)
* Lodgepole Pine (pino)
* Ponderosa Pine (pino)
* Cottonwood/Willow (sauces)
* Aspen (alamos)
* Douglas-fir (pino oregon)
* Krummholz (vegetación atrofiada y deformada, arbol "bandera")

### Missings

No hay missings.

```{r}
colSums(is.na(df))
```
###Variables categoricas###

Una de las variables consiste en medidas del tipo de suelo presente. En el dataset esta variable se mapeo a 40 one-hot variables. Cada variable corresponde a un tipo de suelo, catalogado con un codigo que incluye informacion de la zona climatica y zona geologica. La distribucion de esta variable es la siguiente :

```{r}
library(dplyr)

df_suelos <- df[,15:54]

for (j in 1:length(df_suelos)){
    for (i in 1:nrow(df_suelos)){
        if (df_suelos[i,j]==1){
            df_suelos[i,41]<-paste("Soil_Type",j,sep = "")
        }
    }
}
df_suelos <- rename(df_suelos, suelo=V41)

barplot(sort(table(df_suelos$suelo)),las=2, log="y", xlab = "suelo", col = "#1b98e0")
png('barsuelo.png')
barplot(sort(table(df_suelos$suelo)),las=2, log="y", xlab = "suelo", col = "#1b98e0")

dev.off()

```

El suelo mas abundante tiene el doble de frecuencia que el siguiente mas abundante y su frecuencia esta 4 ordenes de magnitud por encima del suelo menos frecuente. En vista de que son muchas variables individuales a ser consideradas y no pueden ser combinadas (no sin conocimiento experto en geologia) para reducir su numero. No sera considerada en el analisis.

Otra de las variables categoricas es la zona silvestre de la celda. La abundancia de zonas:

```{r}
library(dplyr)

df_areas <- df[,11:14]

for (j in 1:length(df_areas)){
    for (i in 1:nrow(df_areas)){
        if (df_areas[i,j]==1){
            df_areas[i,5]<-paste("Area",j,sep = "")
        }
    }
}
df_areas<- rename(df_areas, areas=V5)

barplot(sort(table(df_areas$areas)),las=2, xlab = "areas silvestres", col = "#F4A582")

png('barareas.png')

barplot(sort(table(df_areas$areas)),las=2, xlab = "areas silvestres", col = "#F4A582")

dev.off()
```
Las areas silvestres 3 y 4 son alrededor de 8 veces mas abundantes en el dataset que las 1 y 2. Tambien se podria usar para clasificar pero se escoge la clase de cobertura forestal.

Descripcion de la variable objetivo, las clases de cobertura vegetal:

```{r message=FALSE}
library(plotly)

labels = c('Spruce/Fir','Lodgepole Pine','Ponderosa Pine','Cottonwood/Willow','Aspen','Douglas-fir',
          'Krummholz')
values <- aggregate(df$Elevation~ df$Forest_Cover_Type, data=df, FUN=length)
fig <- plot_ly(type='pie', labels=labels, values=values$`df$Elevation`, 
               textinfo='label+percent',
               insidetextorientation='radial',showlegend = FALSE)
fig

#png('covertype.png')

#fig

#dev.off()

#pie(table(df$Forest_Cover_Type),main='Tipos de covertura')
#barplot(table(df$World.region), xlab = "Region", ylab = "Cantidad de Ciudades",         main="Ciudades por region",cex.names = .3, las=2) 
```

Se puede observar que los datos estan bastante desbalanceados. Esto incidiría en una regresión pero no en un LDA.

Resumen de los datos:

```{r}
df_max<-apply(df[,1:10],2,max)
df_min<-apply(df[,1:10],2,min)
df_n <- sweep(df[,1:10], 2, df_min, "-")
range <- df_max-df_min
df_n <- sweep(df_n, 2, range, "/")
boxplot(df_n,las=2)
```

Todas las distribuciones son sesgadas y tienen muchos outliers. Para clasificar, si se va a aplicar un LDA, hay que garantizar normalidad y que las medias sean distintas, pero se vera mas adelante. Tambien es importante garantizar que los atributos sean independientes.

La variable Aspect es bimodal y probablemente Slope es multimodal. Todas las variables tienen colas largas, excepto Hillshade_3pm. En todas las variables excepto en Elevation las clases siguen la misma tendencia. En esta variable las distribuciones para cada grupo estan centradas distinto:


```{r message=FALSE}
library(ggplot2)
library(easyGgplot2)


set.seed(42)

rows <- sample(nrow(df))
df_rand <- df[rows, ]
df_new <- df_rand[seq(1,500000,100),c(1:10)]
cover_type <- df_rand[seq(1,500000,100),55]
subset0 <- cbind(df_new,cover_type)
subset0$cover_type <- as.factor(subset0$cover_type)

# Histograma Elevation
plot1 <- ggplot2.histogram(data=subset0, xName='Elevation', groupName='cover_type', maintitle = "Histograma Elevation", xtitleFont=c(10,"bold", "black"), ytitleFont=c(10,"bold", "black"), xTickLabelFont=c(7, "bold", "black"), yTickLabelFont=c(7, "bold", "black"), xtitle = "Elevation", ytitle = "Frecuencia", backgroundColor="white", brewerPalette="Paired",alpha=0.5,showLegend=FALSE,position="stack")
# Histograma Aspect
plot2 <- ggplot2.histogram(data=subset0, xName='Aspect',groupName='cover_type', maintitle = "Histograma Aspect", xtitleFont=c(10,"bold", "black"), ytitleFont=c(10,"bold", "black"), xTickLabelFont=c(7, "bold", "black"), yTickLabelFont=c(7, "bold", "black"), xtitle = "Aspect", ytitle = "Frecuencia", backgroundColor="white", brewerPalette="Paired",alpha=0.5,showLegend=FALSE,position="stack")
# Histograma Slope
plot3 <- ggplot2.histogram(data=subset0, xName='Slope', groupName='cover_type',maintitle = "Histograma Slope", xtitleFont=c(10,"bold", "black"), ytitleFont=c(10,"bold", "black"), xTickLabelFont=c(7, "bold", "black"), yTickLabelFont=c(7, "bold", "black"), xtitle = "Slope", ytitle = "Frecuencia", backgroundColor="white", brewerPalette="Paired",alpha=0.5,showLegend=FALSE,position="stack")
# Histograma h hydro
plot4 <-ggplot2.histogram(data=subset0, xName='Horizontal_Distance_To_Hydrology', groupName='cover_type',maintitle = "Histograma Horiz. distance to hydrology", xtitleFont=c(10,"bold", "black"), ytitleFont=c(10,"bold", "black"), xTickLabelFont=c(7, "bold", "black"), yTickLabelFont=c(7, "bold", "black"), xtitle = "H_hydrology", ytitle = "Frecuencia", backgroundColor="white", brewerPalette="Paired",alpha=0.5,showLegend=FALSE,position="stack")
# Histograma v hydro
plot5 <-ggplot2.histogram(data=subset0, xName='Vertical_Distance_To_Hydrology', groupName='cover_type',maintitle = "Histograma Vert. distance to hydrology", xtitleFont=c(10,"bold", "black"), ytitleFont=c(10,"bold", "black"), xTickLabelFont=c(7, "bold", "black"), yTickLabelFont=c(7, "bold", "black"), xtitle = "V_hydrology", ytitle = "Frecuencia", backgroundColor="white", brewerPalette="Paired",alpha=0.5,showLegend=FALSE,position="stack")
# Histograma h road
plot6 <-ggplot2.histogram(data=subset0, xName='Horizontal_Distance_To_Roadways', groupName='cover_type',maintitle = "Histograma Horiz. distance to roadways", xtitleFont=c(10,"bold", "black"), ytitleFont=c(10,"bold", "black"), xTickLabelFont=c(7, "bold", "black"), yTickLabelFont=c(7, "bold", "black"), xtitle = "H_roadways", ytitle = "Frecuencia", backgroundColor="white", brewerPalette="Paired",alpha=0.5,showLegend=FALSE,position="stack")
# Histograma hill 9
plot7 <-ggplot2.histogram(data=subset0, xName='Hillshade_9am', groupName='cover_type',maintitle = "Histograma Hillshade_9am", xtitleFont=c(10,"bold", "black"), ytitleFont=c(10,"bold", "black"), xTickLabelFont=c(7, "bold", "black"), yTickLabelFont=c(7, "bold", "black"), xtitle = "Hillshade_9am", ytitle = "Frecuencia", backgroundColor="white", brewerPalette="Paired",alpha=0.5,showLegend=FALSE,position="stack")
# Histograma hill 12
plot8 <-ggplot2.histogram(data=subset0, xName='Hillshade_Noon', groupName='cover_type',maintitle = "Histograma Hillshade_noon", xtitleFont=c(10,"bold", "black"), ytitleFont=c(10,"bold", "black"), xTickLabelFont=c(7, "bold", "black"), yTickLabelFont=c(7, "bold", "black"), xtitle = "Hillshade_noon", ytitle = "Frecuencia", backgroundColor="white", brewerPalette="Paired",alpha=0.5,showLegend=FALSE,position="stack")
# Histograma hill 3
plot9 <-ggplot2.histogram(data=subset0, xName='Hillshade_3pm', groupName='cover_type',maintitle = "Histograma Hillshade_3pm", xtitleFont=c(10,"bold", "black"), ytitleFont=c(10,"bold", "black"), xTickLabelFont=c(7, "bold", "black"), yTickLabelFont=c(7, "bold", "black"), xtitle = "Hillshade_3pm", ytitle = "Frecuencia", backgroundColor="white", brewerPalette="Paired",alpha=0.5,showLegend=FALSE,position="stack")
# Histograma h firepoint
plot10 <-ggplot2.histogram(data=subset0, xName='Horizontal_Distance_To_Fire_Points', groupName='cover_type', maintitle = "Histograma Horiz. distance to firepoints", xtitleFont=c(10,"bold", "black"), ytitleFont=c(10,"bold", "black"), xTickLabelFont=c(7, "bold", "black"), yTickLabelFont=c(7, "bold", "black"), xtitle = "H_firepoints", ytitle = "Frecuencia", backgroundColor="white", brewerPalette="Paired",alpha=0.5,showLegend=FALSE,position="stack")
#

ggplot2.multiplot(plot1,plot2,plot3,plot4,plot5,plot6,plot7,plot8,plot9,plot10, cols=4)

png('histograms.png')

ggplot2.multiplot(plot1,plot2,plot3,plot4,plot5,plot6,plot7,plot8,plot9,plot10, cols=3)

dev.off()
```

### Correlaciones


```{r}
library(ggcorrplot)

corr <- round(cor(subset0[,1:10]), 1)
p.mat<- cor_pmat(subset0[,1:10])

ggcorrplot(corr, lab = TRUE,
     outline.col = "white",lab_size = 3, tl.cex=5,method = "circle",hc.order = TRUE, p.mat = p.mat)

png('correlation.png')

ggcorrplot(corr, lab = TRUE,
     outline.col = "white",lab_size = 3, tl.cex=5, method = "circle",hc.order = TRUE, p.mat = p.mat)

dev.off()

```

Las variables hillshade estan relacionadas con aspect, slope y entre ellas. Esto tiene que ver con como se calculan las cantidades hillshade que son los sombreados que se dibujan sobre los mapas cartograficos. Se supone una iluminacion simulada que depende de la orientacion a la fuente de luz, la cual esta basada en las variables aspect y slope. Las cantidades distancia vertical y horizontal a cursos de agua tambien estan relacionadas y se calculan con slope y elevation. La distancia horizontal a caminos y a puntos de fuego tambien se calculan con la elevacion y estan relacionados entre si.

Con base en esto acoto las variables a Elevation, Aspect, Slope, Hillshade_noon, Horizontal_Distance_Roadways y Horizontal_Distance_to_Hydrology.


```{r message=FALSE, warning=FALSE, paged.print=FALSE}
library(GGally)
library(data.table)

color <- as.factor(subset[,7])
ggpairs(subset[,1:5], aes(color = color, alpha = 0.5),lower = list(combo = "count"),upper = "blank",)
```

Se pueden observar tambien las distribuciones de las clases, son sesgadas como se vio en los boxplots. La variable que podria servir mas para distinguir clases podria ser Elevation.


```{r}
subsubset <- subset[,c(1:4,6,8,11)]
```


### Analisis explotario con PCA
```{r message=FALSE}
library(ggbiplot)
library(plyr)

datos_pca <- prcomp(subset[,1:5,7],scale=TRUE)
p <- ggbiplot(datos_pca, obs.scale=0.01,alpha = 0.3,groups=subset$cover_type)
p <- p + xlim(-3, 2) + ylim(-2, 3)

plot(p)

png('pca.png')

plot(p)

dev.off()

```


# Clasificacion Supervisada

## Linear Discriminant Analysis

El primer metodo a implementar es LDA. Para que tenga validez el metodo deben cumplirse los supuestos de normalidad multivariada para cada nivel de cada variable y de homocedasticidad.

Para el test de normalidad multivariada usamos test de Shapiro:

```{r}
library(mvnormtest)
test <- mshapiro.test(t(subsubset[subsubset$cover_type=='1',1:6]))
shapitest <- test$p.value
test <- mshapiro.test(t(subsubset[subsubset$cover_type=='2',1:6]))
shapitest <- append(shapitest,test$p.value)
test <- mshapiro.test(t(subsubset[subsubset$cover_type=='3',1:6]))
shapitest <- append(shapitest,test$p.value)
test <- mshapiro.test(t(subsubset[subsubset$cover_type=='4',1:6]))
shapitest <- append(shapitest,test$p.value)
test <- mshapiro.test(t(subsubset[subsubset$cover_type=='5',1:6]))
shapitest <- append(shapitest,test$p.value)
test <- mshapiro.test(t(subsubset[subsubset$cover_type=='6',1:6]))
shapitest <- append(shapitest,test$p.value)
test <- mshapiro.test(t(subsubset[subsubset$cover_type=='7',1:6]))
shapitest <- append(shapitest,test$p.value)

shapitest
```
Rechaza normalidad para todos los niveles de todas las variables. Para el total:

```{r}
library(mvnormtest)
mshapiro.test(t(subset[,1:6]))
```

Si el nivel de significacion por defecto es de 0.05, rechaza la normalidad multivariada. 

El siguiente supuesto de homocedasticidad lo evaluamos con el test M de Box:

```{r}
library(biotools)

boxM(data=subsubset[,1:6],grouping=subsubset[,7])
```
Rechaza la homogeneidad de varianzas. Se tendria que realizar un LDA robusto (cuadratico).

Igual, a pesar de haber rechazado la normalidad hacemos el test de Hotelling para determinar si las medias de los grupos son distintas:

```{r}
library(Hotelling)

fitProd = hotelling.test(.~ cover_type, data = subsubset) 
fitProd
```
Por ende rechazamos que las medias sean iguales. El cover_type es discriminante.

# LDA

Habiendo determinado que la variable de clase es discriminante, calculamos el lda:

```{r}

library(MASS)

modelo_lda <- lda(formula = cover_type ~ Elevation + Aspect + Slope + Horizontal_Distance_To_Hydrology + Horizontal_Distance_To_Roadways + Hillshade_Noon, data = subset)

```

Ahora se prueba el modelo:

```{r}
predicciones <-  predict(object = modelo_lda, newdata = testeo, method = "predictive")
table(clase, predicciones$class, dnn = c("Clase real", "Clase predicha"))
```
Los errores de prediccion:

```{r}
training_error <- mean(clase != predicciones$class) * 100 
training_error 
```
El error da bastante grande...

Con una variable mas:

```{r}
library(klaR) 

testeo$clase <- as.factor(testeo$clase)

partimat(clase ~ ., data = testeo, method = "lda", col.correct=NA, col.wrong=NA,plot.matrix=FALSE,image.colors =rainbow(7),)
```
Sin esa variable:

```{r}
library(klaR) 

testeo$clase <- as.factor(testeo$clase)

partimat(clase ~ ., data = testeo[,c(1:5,7)], method = "lda", col.correct=NA, col.wrong=NA,plot.matrix=FALSE,image.colors =rainbow(7),scaled=TRUE)

png('partiplot.png')

partimat(clase ~ ., data = testeo[,c(1:5,7)], method = "lda", col.correct=NA, col.wrong=NA,plot.matrix=FALSE,image.colors =rainbow(7),scaled=TRUE)

dev.off()

```

Discrimina muy mal. Ahora, si transformo las variables con un log:

```{r}
library(IDPmisc)

df_trans <- log(subset[,1:6])
subset_trans <- cbind(df_trans,cover_type)
subset_t <- NaRV.omit(subsubset_trans)
subset_t$cover_type <- as.factor(subset_t$cover_type)
```


```{r}
library(MASS)

modelo_lda_trans <- lda(formula = cover_type ~. , data = subset_t)

```

```{r}
library(IDPmisc)
library(dplyr)

test_trans <- log(test)
test_t <- cbind(test_trans,clase)
test_t <- NaRV.omit(test_t)
clase_t <-test_t$clase
```


```{r}
predicciones_t <-  predict(object = modelo_lda_trans, newdata = test_t)
table(clase_t, predicciones_t$class, dnn = c("Clase real", "Clase predicha"))
```

```{r}
training_error_t <- mean(clase_t != predicciones_t$class) * 100 
training_error_t 
```
Pesimo.

Si considero una variable mas, sin transformar:


```{r}

library(MASS)

modelo_lda_2 <- lda(formula = cover_type ~. , data = subset_2)

```
```{r}
predicciones_2 <-  predict(object = modelo_lda_2, newdata = testeo_2)
table(clase, predicciones_2$class, dnn = c("Clase real", "Clase predicha"))
```

```{r}
training_error_2 <- mean(testeo_2$clase != predicciones_2$class) * 100 
training_error_2 
```
Considerar las variables con logaritmo o considerar una variable adicional no mejoro la prediccion. Se intentara ahora con un lda robusto:



```{r}

library(MASS)

modelo_qda <- qda(formula = cover_type ~ ., data = subset)

```

```{r}
predicciones_qda <-  predict(object = modelo_qda, newdata = testeo, method = "predictive")
table(clase, predicciones_qda$class, dnn = c("Clase real", "Clase predicha"))
```
```{r}
training_error_qda <- mean(clase != predicciones_qda$class) * 100 
training_error_qda
```
Da peor...

Con las clases escaladas:

```{r}
datos_escalados <- as.data.frame(scale(subset[,1:5]))
subset_esc <- cbind(datos_escalados,cover_type)
subset_esc$cover_type <- as.factor(subset_esc$cover_type)
```

test de Shapiro para los datos escalados:

```{r}
subset_esc$cover_type <- as.factor(subset_esc$cover_type)
```

```{r}
library(mvnormtest)
test <- mshapiro.test(t(subset_esc[subset_esc$cover_type=='1',1:5]))
shapitest <- test$p.value
test <- mshapiro.test(t(subset_esc[subset_esc$cover_type=='2',1:5]))
shapitest <- append(shapitest,test$p.value)
test <- mshapiro.test(t(subset_esc[subset_esc$cover_type=='3',1:5]))
shapitest <- append(shapitest,test$p.value)
test <- mshapiro.test(t(subset_esc[subset_esc$cover_type=='4',1:5]))
shapitest <- append(shapitest,test$p.value)
test <- mshapiro.test(t(subset_esc[subset_esc$cover_type=='5',1:5]))
shapitest <- append(shapitest,test$p.value)
test <- mshapiro.test(t(subset_esc[subset_esc$cover_type=='6',1:5]))
shapitest <- append(shapitest,test$p.value)
test <- mshapiro.test(t(subset_esc[subset_esc$cover_type=='7',1:5]))
shapitest <- append(shapitest,test$p.value)

shapitest
```

```{r}

library(MASS)

modelo_lda_esc <- lda(formula = cover_type ~., data = subset_esc)

```

```{r}
test_esc <- as.data.frame(scale(test))
test_esc <- cbind(test_esc,clase)
test_esc$clase <- as.factor(test_esc$clase)
```

```{r}
predicciones_esc <-  predict(object = modelo_lda_esc, newdata = test_esc, method = "predictive")
table(clase, predicciones_esc$class, dnn = c("Clase real", "Clase predicha"))
```

```{r}
training_error_esc <- mean(clase != predicciones_esc$class) * 100 
training_error_esc
```
El mejor resultado fue el inicial. En general se predicen muy mal las clases menos abundantes. 

El qda con los datos escalados:

```{r}

library(MASS)

modelo_qda_esc <- qda(formula = cover_type ~ ., data = subset[,1:5,7])

```

```{r}
predicciones_qda_esc <-  predict(object = modelo_qda_esc, newdata = testeo[,1:5,7], method = "predictive")
table(clase, predicciones_qda_esc$class, dnn = c("Clase real", "Clase predicha"))
```

```{r}
training_errorqda_esc <- mean(clase != predicciones_qda_esc$class) * 100 
training_errorqda_esc
```

Se realizara el analisis con solo dos clases:


```{r}
subsubset2 <-subset[(subset$cover_type==2),]
subsubset1 <-subset[(subset$cover_type==1),]
subsubset <-rbind(subsubset1,subsubset2)
subsubset$cover_type <- droplevels(subsubset$cover_type)
```

El test de normalidad multivariada:

```{r}
library(mvnormtest)

testing <- mshapiro.test(t(subsubset[subsubset$cover_type=='1',1:6]))
shapitest <- testing$p.value
testing <- mshapiro.test(t(subsubset[subsubset$cover_type=='2',1:6]))
shapitest <- append(shapitest,testing$p.value)

shapitest
```
Rechaza normalidad para cada variable para cada nivel de la variable.

Con las variables estandarizadas:

```{r}
library(clusterSim)

#subsubset_es <- scale(subsubset[,1:6], center = median(subsubset))
subsubset_es <- data.Normalization (subsubset[,1:6],type="n2",normalization="column")
subsubset_es <- cbind(subsubset_es,subsubset[,7])
subsubset_es <- rename(subsubset_es, cover_type=`subsubset[, 7]`)
```


Con las variables estandarizadas y sacando hillshade:

```{r}
library(mvnormtest)

subsubset_2 <- subsubset_es[,c(1:5,7)]
testing <- mshapiro.test(t(subsubset_2[subsubset_2$cover_type=='1',1:5]))
shapitest <- testing$p.value
testing <- mshapiro.test(t(subsubset_2[subsubset_2$cover_type=='2',1:5]))
shapitest <- append(shapitest,testing$p.value)

shapitest
```
Rechaza, sin sacar y sacando hillshade.

Testeando homocedasticidad:

```{r}
library(biotools)

boxM(data=subsubset[,1:6],grouping=subsubset[,7])
```
Rechaza igualdad de varianzas.

Como el test M de Box es sensible a la falta de normalidad, realizamos el de Levene:

```{r}
library(car)

leveneTest( Elevation + Aspect + Slope + Horizontal_Distance_To_Hydrology + Horizontal_Distance_To_Roadways + Hillshade_Noon ~ subsubset$cover_type, data = subsubset)
```

El valor es menor que el nivel de significancia 0.001. Rechazamos la hipotesis nula y concluimos que las varianzas no son iguales.

De todas maneras hacemos el test de Hotelling para testear diferencia de medias:

```{r}
library(Hotelling)

fitProd = hotelling.test(.~ subsubset$cover_type, data = subsubset) 
fitProd
```
Rechaza igualdad de medias, pero como no se cumple la normalidad multivariada este resultado no es confiable.

# LDA con dos clases

el conjunto de test para dos clases:

```{r}
library(dplyr)

testeo2 <-testeo[(testeo$clase==2),]
testeo1 <-testeo[(testeo$clase==1),]
testeo_2class <-rbind(testeo1,testeo2)
testeo_2class$clase <- droplevels(testeo_2class$clase)
clase_2class <- testeo_2class[,7]
test_2class <-testeo_2class[,1:6]

```


```{r}

library(MASS)

modelo_lda_2class <- lda(formula = cover_type ~. , data = subsubset)

```

la clasificacion ingenua:

```{r}
predicciones_2class_0 <-  predict(object = modelo_lda_2class, newdata = subsubset)
table(subsubset$cover_type, predicciones_2class_0$class, dnn = c("Clase real", "Clase predicha"))
```
```{r}
training_error_2cl_0 <- mean(subsubset$cover_type != predicciones_2class_0$class) * 100 
training_error_2cl_0
```

```{r}
predicciones_2class <-  predict(object = modelo_lda_2class, newdata = testeo_2class)
table(clase_2class, predicciones_2class$class, dnn = c("Clase real", "Clase predicha"))
```

```{r}
training_error_2cl <- mean(clase_2class != predicciones_2class$class) * 100 
training_error_2cl
```

```{r}
library(klaR) 

partimat(subsubset$cover_type ~ ., data = subsubset, method = "lda", col.correct=NA, col.wrong=NA, plot.matrix=FALSE,image.colors =rainbow(2))
```
La variable que mejor separa es elevacion.


Con las variables estandarizadas:

```{r}

library(MASS)

modelo_lda_2class_est <- lda(formula = cover_type ~. , data = subsubset_es)

```

la clasificacion ingenua:

```{r}
predicciones_2class_es_0 <-  predict(object = modelo_lda_2class_est, newdata = subsubset_es)
table(subsubset_es$cover_type, predicciones_2class_es_0$class, dnn = c("Clase real", "Clase predicha"))
```

```{r}
training_error_2cl_es_0 <- mean(subsubset_es$cover_type != predicciones_2class_es_0$class) * 100 
training_error_2cl_es_0

```
```{r}
library(clusterSim)

#subsubset_es <- scale(subsubset[,1:6], center = median(subsubset))
testeo_es <- data.Normalization (testeo_2class[,1:6],type="n2",normalization="column")
testeo_es <- cbind(testeo_es,testeo_2class$clase)
testeo_es <- rename(testeo_es, clase=`testeo_2class$clase`)
```


```{r}
predicciones_2class_es <-  predict(object = modelo_lda_2class_est, newdata = testeo_es)
table(testeo_es$clase, predicciones_2class_es$class, dnn = c("Clase real", "Clase predicha"))
```

```{r}
training_error_2cl_es <- mean(testeo_es$clase != predicciones_2class_es$class) * 100 
training_error_2cl_es

```
Haciendo el discriminante robusto:

```{r}

library(MASS)

modelo_qda_2class_est <- qda(formula = cover_type ~. , data = subsubset_es)

```

```{r}
prediccionesqda_2c_es_0 <-  predict(object = modelo_qda_2class_est, newdata = subsubset_es)
table(subsubset_es$cover_type, prediccionesqda_2c_es_0$class, dnn = c("Clase real", "Clase predicha"))
```

```{r}
training_errorqda_2cl_es_0 <- mean(subsubset_es$cover_type != prediccionesqda_2c_es_0$class) * 100 
training_errorqda_2cl_es_0

```
```{r}
prediccionesqda_2c_es <-  predict(object = modelo_qda_2class_est, newdata = testeo_es)
table(testeo_es$clase, prediccionesqda_2c_es$class, dnn = c("Clase real", "Clase predicha"))
```

```{r}
training_errorqda_2cl_es <- mean(testeo_es$clase != prediccionesqda_2c_es$class) * 100 
training_errorqda_2cl_es

```
Incluso empeora un poco.

## Support Vector machine

Clasificando con svm el dataset con las 7 clases:

```{r}
library(e1071)

model.svm = svm( subset$cover_type ~ ., data = subset[,c(1:5,7)], kernel = "radial", cost = 10, scale = TRUE)
print(model.svm)

```
```{r}
predicciones_svm = predict(model.svm, subset[,c(1:5,7)])

table(subset$cover_type, predicciones_svm, dnn = c("Clase real", "Clase predicha"))
```
```{r}
training_error_svm_0 <- mean(subset$cover_type != predicciones_svm) * 100 
training_error_svm_0
```
El kernel que mejor da es el radial. 

```{r}
predicciones_svm_te = predict(model.svm, testeo)

table(testeo$clase, predicciones_svm_te, dnn = c("Clase real", "Clase predicha"))
```

```{r}
training_error_svm <- mean(testeo$clase != predicciones_svm_te) * 100 
training_error_svm
```

Si lo hacemos con menos variables:

```{r}
subsubset1 <-subset[(subset$cover_type==1),]
subsubset2 <-subset[(subset$cover_type==2),]
subsubset <-rbind(subsubset1,subsubset2)
subsubset$cover_type <- droplevels(subsubset$cover_type)
```


```{r}
library(e1071)

model.svm_2 = svm( subsubset$cover_type ~ ., data = subsubset[,c(1:5,7)], kernel = "radial", cost = 10, scale = TRUE)
print(model.svm)

```


```{r}
predicciones_svm_2 = predict(model.svm_2, subsubset[,c(1:5,7)])

table(subsubset$cover_type, predicciones_svm_2, dnn = c("Clase real", "Clase predicha"))
```
```{r}
training_error_svm_2 <- mean(subsubset$cover_type != predicciones_svm_2) * 100 
training_error_svm_2
```
Ahora con el test:

```{r}
predicciones_svm_te_2 = predict(model.svm_2, testeo_2class)

table(testeo_2class$clase, predicciones_svm_te_2, dnn = c("Clase real", "Clase predicha"))
```

```{r}
training_error_svm_2_te <- mean(testeo_2class$clase != predicciones_svm_te_2) * 100 
training_error_svm_2_te
```
Graficando para todas las clases:

```{r}
plot(model.svm, data=subset[,c(1:5,7)],Elevation ~ Aspect,  color.palette=topo.colors)
plot(model.svm, data=subset[,c(1:5,7)], Elevation ~ Slope, color.palette=topo.colors)
plot(model.svm, data=subset[,c(1:5,7)], Elevation ~ Horizontal_Distance_To_Hydrology, color.palette=topo.colors)
plot(model.svm, data=subset[,c(1:5,7)], Elevation ~ Horizontal_Distance_To_Roadways, color.palette=topo.colors)
plot(model.svm, data=subset[,c(1:5,7)], Aspect ~ Slope, color.palette=topo.colors)
```

```{r}
plot(model.svm, data=subset[,c(1:5,7)], Aspect ~ Horizontal_Distance_To_Hydrology, color.palette=topo.colors)
plot(model.svm, data=subset[,c(1:5,7)], Aspect ~ Horizontal_Distance_To_Roadways, color.palette=topo.colors)
plot(model.svm, data=subset[,c(1:5,7)], Slope ~ Horizontal_Distance_To_Hydrology, color.palette=topo.colors)
plot(model.svm, data=subset[,c(1:5,7)], Slope ~ Horizontal_Distance_To_Roadways, color.palette=topo.colors)
plot(model.svm, data=subset[,c(1:5,7)], Horizontal_Distance_To_Hydrology ~ Horizontal_Distance_To_Roadways, color.palette=topo.colors)

```
```{r}
plot(model.svm_2, data=subsubset[,c(1:5,7)],Elevation ~ Aspect,  color.palette=topo.colors)
plot(model.svm_2, data=subsubset[,c(1:5,7)], Elevation ~ Slope, color.palette=topo.colors)
plot(model.svm_2, data=subsubset[,c(1:5,7)], Elevation ~ Horizontal_Distance_To_Hydrology, color.palette=topo.colors)
plot(model.svm_2, data=subsubset[,c(1:5,7)], Elevation ~ Horizontal_Distance_To_Roadways, color.palette=topo.colors)
plot(model.svm_2, data=subsubset[,c(1:5,7)], Aspect ~ Slope, color.palette=topo.colors)
```

```{r}
plot(model.svm_2, data=subsubset[,c(1:5,7)], Aspect ~ Horizontal_Distance_To_Hydrology, color.palette=topo.colors)
plot(model.svm_2, data=subsubset[,c(1:5,7)], Aspect ~ Horizontal_Distance_To_Roadways, color.palette=topo.colors)
plot(model.svm_2, data=subsubset[,c(1:5,7)], Slope ~ Horizontal_Distance_To_Hydrology, color.palette=topo.colors)
plot(model.svm_2, data=subsubset[,c(1:5,7)], Slope ~ Horizontal_Distance_To_Roadways, color.palette=topo.colors)
plot(model.svm_2, data=subsubset[,c(1:5,7)], Horizontal_Distance_To_Hydrology ~ Horizontal_Distance_To_Roadways, color.palette=topo.colors)

```


Se dibujan mas fronteras cuando la variable Elevation esta presente.

## Clasificacion no supervisada

Para la clasificacion no supervisada, comenzamos con metodos jerarquicos porque k-means es sensible a outliers:

```{r}
datos_clust <- subset
mat_dist <- dist(x = subset, method = "euclidean") 

```

Dendrogramas para distintos metodos, suponemos 7 clusters:

```{r}
hc_complete <- hclust(d = mat_dist, method = "complete") 
hc_average  <- hclust(d = mat_dist, method = "average")
hc_single   <- hclust(d = mat_dist, method = "single")
hc_ward     <- hclust(d = mat_dist, method = "ward.D2")
hc_cn     <- hclust(d = mat_dist, method = "centroid")
```

Construyendo los dendrogramas:

```{r}
cantidad_clusters = 7

plot(hc_complete)
rect.hclust(hc_complete, k=cantidad_clusters, border="red") #

jer_complete<-cutree(hc_complete,k=cantidad_clusters)           #
datos_clust$jer_complete=jer_complete

```
```{r}
cantidad_clusters = 7

plot(hc_average)
rect.hclust(hc_average, k=cantidad_clusters, border="red") #

jer_average<-cutree(hc_average,k=cantidad_clusters)           #
datos_clust$jer_average=jer_average
```

```{r}
cantidad_clusters = 7

plot(hc_single)
rect.hclust(hc_single, k=cantidad_clusters, border="red") #

jer_single<-cutree(hc_single,k=cantidad_clusters)           #
#datos$jer_complete=jer_complete
```

Da feisimo...

```{r}
cantidad_clusters = 7

plot(hc_ward)
rect.hclust(hc_ward, k=cantidad_clusters, border="red") #

jer_ward<-cutree(hc_ward,k=cantidad_clusters)           #
datos_clust$jer_ward=jer_ward
```


```{r}
cantidad_clusters = 7

plot(hc_cn)
rect.hclust(hc_cn, k=cantidad_clusters, border="red") #

jer_cn<-cutree(hc_cn,k=cantidad_clusters)           #
datos_clust$jer_cn=jer_cn
```


Los coeficientes de correlacion cofenetica:

```{r}
cor(x = mat_dist, cophenetic(hc_complete))
cor(x = mat_dist, cophenetic(hc_average))
cor(x = mat_dist, cophenetic(hc_single))
cor(x = mat_dist, cophenetic(hc_ward))
cor(x = mat_dist, cophenetic(hc_cn))
```
El linkage single es el peor.

```{r}
datos_clust$jer_complete <- as.factor(datos_clust$jer_cn)
p1 <- ggplot(datos_clust, aes(x=Elevation, y=Aspect, color=jer_cn)) +
  geom_point() + scale_color_brewer(palette="Dark2")

p2 <- ggplot(datos_clust, aes(x=Elevation, y=Slope, color=jer_cn)) +
  geom_point()
p2 + scale_color_brewer(palette="Dark2")

p3 <- ggplot(datos_clust, aes(x=Elevation, y=Horizontal_Distance_To_Hydrology, color=jer_cn)) +
  geom_point()
p3 + scale_color_brewer(palette="Dark2")

p4 <- ggplot(datos_clust, aes(x=Elevation, y=Horizontal_Distance_To_Roadways, color=jer_cn)) +
  geom_point()
p4 + scale_color_brewer(palette="Dark2")
```


```{r}
library(ggplot2)
library(easyGgplot2)

datos_clust$jer_complete <- as.factor(datos_clust$jer_cn)

p2 <- ggplot(datos_clust, aes(x=Elevation, y=Slope, color=jer_cn, alpha=0.5)) +
  geom_point() + scale_color_brewer(palette="Dark2") + theme(legend.position = "none")

p3 <- ggplot(datos_clust, aes(x=Elevation, y=Horizontal_Distance_To_Hydrology, color=jer_cn,alpha=0.5)) +
  geom_point() + scale_color_brewer(palette="Dark2")+ theme(legend.position = "none") +labs(y='H_Hydrology')


p4 <- ggplot(datos_clust, aes(x=Elevation, y=Horizontal_Distance_To_Roadways, color=jer_cn,alpha=0.5)) +
  geom_point() + scale_color_brewer(palette="Dark2")+ theme(legend.position = "none")+labs(y='H_Roadways')

p7 <- ggplot(datos_clust, aes(x=Aspect, y=Horizontal_Distance_To_Roadways, color=jer_cn,alpha=0.5)) +
  geom_point() + scale_color_brewer(palette="Dark2")+ theme(legend.position = "none")+labs(y='H_Roadways')

p9 <- ggplot(datos_clust, aes(x=Slope, y=Horizontal_Distance_To_Roadways, color=jer_cn,alpha=0.5)) +
  geom_point() + scale_color_brewer(palette="Dark2")+ theme(legend.position = "none")+labs(y='H_Roadways')

p10 <- ggplot(datos_clust, aes(x=Horizontal_Distance_To_Hydrology, y=Horizontal_Distance_To_Roadways, color=jer_cn,alpha=0.5)) +
  geom_point() + scale_color_brewer(palette="Dark2")+ theme(legend.position = "none")+labs(x='H_Hydrology',y='H_Roadways')

ggplot2.multiplot(p2,p3,p4,p7,p9,p10,cols=3)

png('jerarquical.png')

ggplot2.multiplot(p2,p3,p4,p7,p9,p10,cols=3)

dev.off()
```

```{r}
library(ggplot2)

p <- ggplot(datos_clust, aes(x=Aspect, y=Slope, color=jer_cn)) +
  geom_point()
p + scale_color_brewer(palette="Dark2")

p <- ggplot(datos_clust, aes(x=Aspect, y=Horizontal_Distance_To_Hydrology, color=jer_cn)) +
  geom_point()
p + scale_color_brewer(palette="Dark2")

p <- ggplot(datos_clust, aes(x=Aspect, y=Horizontal_Distance_To_Roadways, color=jer_cn)) +
  geom_point()
p + scale_color_brewer(palette="Dark2")


p <- ggplot(datos_clust, aes(x=Slope, y=Horizontal_Distance_To_Hydrology, color=jer_cn)) +
  geom_point()
p + scale_color_brewer(palette="Dark2")

```
```{r}
library(ggplot2)

p <- ggplot(datos_clust, aes(x=Slope, y=Horizontal_Distance_To_Roadways, color=jer_cn)) +
  geom_point()
p + scale_color_brewer(palette="Dark2")

p <- ggplot(datos_clust, aes(x=Horizontal_Distance_To_Hydrology, y=Horizontal_Distance_To_Roadways, color=jer_complete)) +
  geom_point()
p + scale_color_brewer(palette="Dark2")


```
Solo lo hice con el primer metodo de cluster jerarquico que fue el que mejor dio el coeficiente cofrenetico.




```{r echo=TRUE}
library(cluster)

datos_kmeans = datos_clust[1:5]

cantidad_clusters=7

CL  = kmeans(scale(datos_kmeans),cantidad_clusters)
datos_kmeans$kmeans = CL$cluster
```

```{r}
library(ggplot2)
library(easyGgplot2)

datos_kmeans$kmeans <- as.factor(datos_kmeans$kmeans)

p1 <- ggplot(datos_clust, aes(x=Elevation, y=Aspect, color=datos_kmeans$kmeans, alpha=0.5)) +
  geom_point() + scale_color_brewer(palette="Dark2") + theme(legend.position = "none")

p2 <- ggplot(datos_clust, aes(x=Elevation, y=Slope, color=datos_kmeans$kmeans, alpha=0.5)) +
  geom_point() + scale_color_brewer(palette="Dark2") + theme(legend.position = "none")

p3 <- ggplot(datos_clust, aes(x=Elevation, y=Horizontal_Distance_To_Hydrology, color=datos_kmeans$kmeans,alpha=0.5)) +
  geom_point() + scale_color_brewer(palette="Dark2")+ theme(legend.position = "none") +labs(y='H_Hydrology')

p4 <- ggplot(datos_clust, aes(x=Elevation, y=Horizontal_Distance_To_Roadways, color=datos_kmeans$kmeans,alpha=0.5)) +
  geom_point() + scale_color_brewer(palette="Dark2")+ theme(legend.position = "none")+labs(y='H_Roadways')

p5 <- ggplot(datos_clust, aes(x=Aspect, y=Slope, color=datos_kmeans$kmeans, alpha=0.5)) +
  geom_point() + scale_color_brewer(palette="Dark2") + theme(legend.position = "none")

p6 <- ggplot(datos_clust, aes(x=Aspect, y=Horizontal_Distance_To_Hydrology, color=datos_kmeans$kmeans, alpha=0.5)) +
  geom_point() + scale_color_brewer(palette="Dark2") + theme(legend.position = "none") +labs(y='H_Hydrology')

p7 <- ggplot(datos_clust, aes(x=Aspect, y=Horizontal_Distance_To_Roadways, color=datos_kmeans$kmeans,alpha=0.5)) +
  geom_point() + scale_color_brewer(palette="Dark2")+ theme(legend.position = "none")+labs(y='H_Roadways')

p8 <- ggplot(datos_clust, aes(x=Slope, y=Horizontal_Distance_To_Hydrology, color=datos_kmeans$kmeans,alpha=0.5)) +
  geom_point() + scale_color_brewer(palette="Dark2")+ theme(legend.position = "none")+labs(y='H_Roadways')

p9 <- ggplot(datos_clust, aes(x=Slope, y=Horizontal_Distance_To_Roadways, color=datos_kmeans$kmeans,alpha=0.5)) +
  geom_point() + scale_color_brewer(palette="Dark2")+ theme(legend.position = "none")+labs(y='H_Roadways')

p10 <- ggplot(datos_clust, aes(x=Horizontal_Distance_To_Hydrology, y=Horizontal_Distance_To_Roadways, color=datos_kmeans$kmeans,alpha=0.5)) +
  geom_point() + scale_color_brewer(palette="Dark2")+ theme(legend.position = "none")+labs(x='H_Hydrology',y='H_Roadways')

ggplot2.multiplot(p1,p2,p3,p4,p5,p6,p7,p8,p9,p10,cols=3)

png('jerarquical_2.png')

ggplot2.multiplot(p1,p2,p3,p4,p5,p6,p7,p8,p9,p10,cols=3)

dev.off()
```


